Abstract

Down syndrome (DS) is the most common genetic cause of intellectual disability (ID), however there are no treatment. In this study we used the Ts65Dn mouse model of DS which is trisomic for orthologs of 55% of human chromosome 21 classical protein coding genes. To investigate potentially differentially expressed genes, we compared 8 classes of mice in accordance with their genotypes, behaviour and treatment.

Introduction

Down’s syndrome, also known as trisomy 21, is a genetic disorder caused by the presence of all or part of a third copy of chromosome 21. It is usually associated with physical growth delays, mild to moderate intellectual disability, and characteristic facial features.The average IQ of a young adult with Down syndrome is 50, equivalent to the mental ability of an eight- or nine-year-old child. The parents of the affected individual are usually genetically normal. The probability increases from less than 0.1% in 20-year-old mothers to 3% in those of age 45. The extra chromosome is believed to occur by chance, with no known behavioral activity or environmental factor that changes the probability. Revealing particular diferentially expressed genes between classes is a useful tool in a context of development new treatment strategy.

Methods

Libraries required for computational analysis in R

For our analysis it is necessary to use these libraries. Please check that these libraries have installed.

require(readxl)
require(ggplot2)
require(tidyr)
require(dplyr)
require(cowplot)
require(RColorBrewer)
require(car)
require(multcomp)
require(vegan)
require(ggfortify)
require(factoextra)
require(pca3d)
## Warning in rgl.init(initValue, onlyNULL): RGL: unable to open X11 display
## Warning: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'.
require(rgl)
require(plotly)

Collecting and preformatting data

We collect the data named “Mice Protein Expression Data Set” from UCI Machine Learning Repository in .xls format.

setwd('~/R_mouse')
Mice <- read_xls('Data_Cortex_Nuclear.xls')
Mice$class_factor <- as.factor(Mice$class)

Our data represent the level of gene expression of 77 proteins among 72 mouse. For each mouse we have many repeating measurements, so it is useful to add new column represented particular mice.

Mice$ID <- lapply(strsplit(Mice$MouseID, '_'), function(x)x[1])

We distinguished 8 classes of mice: c-CS-m, c-SC-m, c-CS-s, c-SC-s, t-CS-m, t-SC-m, t-CS-s, t-SC-s. From data description we understood that meaning:

c-CS-s: control mice, stimulated to learn, injected with saline;

c-CS-m: control mice, stimulated to learn, injected with memantine;

c-SC-s: control mice, not stimulated to learn, injected with saline;

c-SC-m: control mice, not stimulated to learn, injected with memantine;

t-CS-s: trisomy mice, stimulated to learn, injected with saline;

t-CS-m: trisomy mice, stimulated to learn, injected with memantine;

t-SC-s: trisomy mice, not stimulated to learn, injected with saline;

t-SC-m: trisomy mice, not stimulated to learn, injected with memantine

Then we calculated the number of mice in each class:

unique(Mice[,c('ID','class_factor')]) %>% group_by(class_factor) %>% count()
## # A tibble: 8 x 2
## # Groups:   class_factor [8]
##   class_factor     n
##   <fct>        <int>
## 1 c-CS-m          10
## 2 c-CS-s           9
## 3 c-SC-m          10
## 4 c-SC-s           9
## 5 t-CS-m           9
## 6 t-CS-s           7
## 7 t-SC-m           9
## 8 t-SC-s           9

We will take a look at the amount of measurements in different groups. It’s clear that we have 15 repeated measurements for each mice, so it this step it’s sufficient to estimate values only among unique mice (not considering repeated measurements). We represented data in three different ways:

ggplot(data = unique(Mice[,c('ID','class_factor', 'Treatment', 'Genotype', 'Behavior')]), aes(x = class_factor,fill = Genotype)) + geom_bar() + theme_bw() +xlab('Class') + ylab('Amount of mice') + ggtitle('Amount of mice in different classes with respect of genotype') + scale_fill_brewer(palette = "Set2") 

It’s clear from barcharts that our groups are not balanced, the amount of mice with control genotype is bigger.

Tr <- ggplot(data = unique(Mice[,c('ID', 'Treatment', 'Genotype', 'Behavior')]), aes(x = Genotype,fill = Treatment)) + geom_bar() + theme_bw() +xlab('Genotype') + ylab('Amount of mice') + ggtitle('Amount of mice with different\ngenotypes with respect of\ntreatment') + scale_fill_brewer(palette = "Set2") 

Beh <- ggplot(data = unique(Mice[,c('ID','Treatment', 'Genotype', 'Behavior')]), aes(x = Genotype, fill = Behavior)) + geom_bar() + theme_bw() +xlab('Genotype') + ylab('Amount of mice') + ggtitle('Amount of mice with different\ngenotypes with respect of\nbehavior') + scale_fill_brewer(palette = "Set2") 

plot_grid(Tr, Beh, nrow = 1)

Then it’s important to calcutate the amount of NA-values in our data. The percentage of rows without NA may be estimated as 51.1111111 %. It’s too many to delete it all, so we will decide what to do in each particular type of further analysis.

Results

1)

Firstly, we compared the level of BDNF_N gene expression between classes. To exclude NA only in BDNF_N column (and also ID, treatment, class of course), we created a smaller dataframe:

BDNF_N_analysis <- na.omit(Mice[,c('ID', 'BDNF_N', 'Treatment', 'class_factor')])

In such way we saved rows with NA in different columns and as the result we got 1077 rows out of 1080. At the first step, we graphically compared the level of BDNF_N expression among classes:

ggplot(data = BDNF_N_analysis, aes(x = class_factor, y = BDNF_N, fill = Treatment )) + geom_boxplot() + scale_fill_brewer(palette = "Set2") + theme_bw() +xlab('Classes') + ylab('BDNF_N expression level') + ggtitle('BDNF_N gene expression level')

Firstly, we tried Anova to find signigicant diferences between classes.

model <- lm(BDNF_N ~ class_factor, BDNF_N_analysis)
model_anova <- Anova(model)
model_anova
## Anova Table (Type II tests)
## 
## Response: BDNF_N
##               Sum Sq   Df F value    Pr(>F)    
## class_factor 0.28784    7  18.816 < 2.2e-16 ***
## Residuals    2.33619 1069                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significant differences in BDNF_N expression level between groups were detected (F = 18.816, p_value < 2.2e-16, df_1 = 7, df_2 = 1069) Of note, ANOVA has applicability conditions: normal distribution of residues, homogeneity of dispersion residues, no collinearity (group independence), randomness and independence of observations in groups.

mod_diag <- fortify(model)
ggplot(mod_diag, aes(x = 1:nrow(mod_diag), y = .cooksd)) + geom_bar(stat = "identity", fill = 'chartreuse3') + ggtitle("Cook's distance graph ") + theme_bw() + xlab('Row number') + ylab('Cook distance')

Cook distances are small that’s why model fits. Then we checked the distribution of residues

ggplot(mod_diag, aes(x = class_factor, y = .stdresid, fill = class_factor)) + geom_boxplot() + ggtitle("The distribution of residues") + theme_bw() + scale_fill_brewer(palette = "Set2") + xlab("Class") + ylab("Resudues distribution")

In our case there are no significant differences amoung the amount of groups, that’s why Anova is a good tool for our results even with some outliers.

To identify pairwise differences, the Tukey post-hock test was performed.

model <- lm(BDNF_N ~ class_factor, BDNF_N_analysis)
post_hoch <- multcomp::glht(model, mcp(class_factor = "Tukey"))
result<-summary(post_hoch)
result
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = BDNF_N ~ class_factor, data = BDNF_N_analysis)
## 
## Linear Hypotheses:
##                        Estimate Std. Error t value Pr(>|t|)    
## c-CS-s - c-CS-m == 0  0.0030979  0.0055459   0.559   0.9993    
## c-SC-m - c-CS-m == 0 -0.0482717  0.0053980  -8.942    <0.01 ***
## c-SC-s - c-CS-m == 0 -0.0258249  0.0055459  -4.657    <0.01 ***
## t-CS-m - c-CS-m == 0 -0.0264852  0.0055459  -4.776    <0.01 ***
## t-CS-s - c-CS-m == 0 -0.0337570  0.0059483  -5.675    <0.01 ***
## t-SC-m - c-CS-m == 0 -0.0181541  0.0055459  -3.273   0.0240 *  
## t-SC-s - c-CS-m == 0 -0.0136310  0.0055790  -2.443   0.2208    
## c-SC-m - c-CS-s == 0 -0.0513696  0.0055459  -9.263    <0.01 ***
## c-SC-s - c-CS-s == 0 -0.0289228  0.0056900  -5.083    <0.01 ***
## t-CS-m - c-CS-s == 0 -0.0295831  0.0056900  -5.199    <0.01 ***
## t-CS-s - c-CS-s == 0 -0.0368549  0.0060829  -6.059    <0.01 ***
## t-SC-m - c-CS-s == 0 -0.0212520  0.0056900  -3.735    <0.01 ** 
## t-SC-s - c-CS-s == 0 -0.0167289  0.0057223  -2.923   0.0686 .  
## c-SC-s - c-SC-m == 0  0.0224468  0.0055459   4.047    <0.01 ** 
## t-CS-m - c-SC-m == 0  0.0217865  0.0055459   3.928    <0.01 ** 
## t-CS-s - c-SC-m == 0  0.0145147  0.0059483   2.440   0.2228    
## t-SC-m - c-SC-m == 0  0.0301176  0.0055459   5.431    <0.01 ***
## t-SC-s - c-SC-m == 0  0.0346406  0.0055790   6.209    <0.01 ***
## t-CS-m - c-SC-s == 0 -0.0006603  0.0056900  -0.116   1.0000    
## t-CS-s - c-SC-s == 0 -0.0079321  0.0060829  -1.304   0.8972    
## t-SC-m - c-SC-s == 0  0.0076708  0.0056900   1.348   0.8798    
## t-SC-s - c-SC-s == 0  0.0121939  0.0057223   2.131   0.3953    
## t-CS-s - t-CS-m == 0 -0.0072718  0.0060829  -1.195   0.9331    
## t-SC-m - t-CS-m == 0  0.0083311  0.0056900   1.464   0.8260    
## t-SC-s - t-CS-m == 0  0.0128542  0.0057223   2.246   0.3246    
## t-SC-m - t-CS-s == 0  0.0156029  0.0060829   2.565   0.1697    
## t-SC-s - t-CS-s == 0  0.0201260  0.0061130   3.292   0.0226 *  
## t-SC-s - t-SC-m == 0  0.0045231  0.0057223   0.790   0.9936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

The most significant differences were revealed in pairs:

c-SC-m & c-CS-m (F = -0.0482717, p_value < 0.001)

c-SC-s & c-CS-m (F = -0.0258249, p_value < 0.001)

t-CS-m & c-CS-m (F = -0.0264852, p_value < 0.001)

t-CS-s & c-CS-m (F = -0.0337570, p_value < 0.001)

c-SC-m & c-CS-s (F = -0.0513696, p_value < 0.001)

c-SC-s & c-CS-s (F = -0.0289228, p_value < 0.001)

t-CS-m & c-CS-s (F = -0.0295831, p_value < 0.001)

t-CS-s & c-CS-s (F = -0.0368549, p_value < 0.001)

t-SC-m & c-SC-m (F = 0.0301176, p_value < 0.001)

t-SC-s & c-SC-m (F = 0.0346406, p_value < 0.001)

Interestingly, no significant difference revealed between t-SC-m & c-SC-s, suggesting that memantine can somewhat normalize the expression of BDNF_N gene.

2)

Then we tried to build a linear model that can predict the level of production of the ERBB4_N protein based on data on other 76 proteins in the experiment. Let’s firstly calculate the amount of NA’s among different proteins. It’s too much to exclude all NA’s and we had to be tricky.

FIND_NA <- as.data.frame(is.na(Mice))
Result <- as.data.frame(apply(FIND_NA, 2, sum))
Result
##                 apply(FIND_NA, 2, sum)
## MouseID                              0
## DYRK1A_N                             3
## ITSN1_N                              3
## BDNF_N                               3
## NR1_N                                3
## NR2A_N                               3
## pAKT_N                               3
## pBRAF_N                              3
## pCAMKII_N                            3
## pCREB_N                              3
## pELK_N                               3
## pERK_N                               3
## pJNK_N                               3
## PKCA_N                               3
## pMEK_N                               3
## pNR1_N                               3
## pNR2A_N                              3
## pNR2B_N                              3
## pPKCAB_N                             3
## pRSK_N                               3
## AKT_N                                3
## BRAF_N                               3
## CAMKII_N                             3
## CREB_N                               3
## ELK_N                               18
## ERK_N                                3
## GSK3B_N                              3
## JNK_N                                3
## MEK_N                                7
## TRKA_N                               3
## RSK_N                                3
## APP_N                                3
## Bcatenin_N                          18
## SOD1_N                               3
## MTOR_N                               3
## P38_N                                3
## pMTOR_N                              3
## DSCR1_N                              3
## AMPKA_N                              3
## NR2B_N                               3
## pNUMB_N                              3
## RAPTOR_N                             3
## TIAM1_N                              3
## pP70S6_N                             3
## NUMB_N                               0
## P70S6_N                              0
## pGSK3B_N                             0
## pPKCG_N                              0
## CDK5_N                               0
## S6_N                                 0
## ADARB1_N                             0
## AcetylH3K9_N                         0
## RRP1_N                               0
## BAX_N                                0
## ARC_N                                0
## ERBB4_N                              0
## nNOS_N                               0
## Tau_N                                0
## GFAP_N                               0
## GluR3_N                              0
## GluR4_N                              0
## IL1B_N                               0
## P3525_N                              0
## pCASP9_N                             0
## PSD95_N                              0
## SNCA_N                               0
## Ubiquitin_N                          0
## pGSK3B_Tyr216_N                      0
## SHH_N                                0
## BAD_N                              213
## BCL2_N                             285
## pS6_N                                0
## pCFOS_N                             75
## SYP_N                                0
## H3AcK18_N                          180
## EGR1_N                             210
## H3MeK4_N                           270
## CaNA_N                               0
## Genotype                             0
## Treatment                            0
## Behavior                             0
## class                                0
## class_factor                         0
## ID                                   0

Now it became clear that for better prediction it’s necessary to exclude proteins: BAD_N, BCL2_N, pCFOS_N, H3AcK18_N, EGR1_N, H3MeK4_N.

exclude_names <- c("BAD_N", "BCL2_N", "pCFOS_N", "H3AcK18_N", "EGR1_N", "H3MeK4_N")
Mice_good <- Mice[,!colnames(Mice) %in% exclude_names]
Mice_good <- na.omit(Mice_good)

After such step we have 1047 rows, that’s pretty good because we saved more data then potentially lost. To create good linear model let’s firstly caclulate a correlation matrix.

Mice_good_numeric <- Mice_good[,2:72]
Cor_matrix <- cor(Mice_good_numeric)

We tried to find proteins which is strongly correlate among each other (with coefficient 0.7 or more) and exclude it.

cor_vector <- vector()
Cor_matrix_without_ERBB4_N <- Cor_matrix[!row.names(Cor_matrix) %in% 'ERBB4_N', !colnames(Cor_matrix) %in% 'ERBB4_N']
for (i in 1:ncol(Cor_matrix_without_ERBB4_N)){
  for (j in 1:nrow(Cor_matrix_without_ERBB4_N)){
    if (abs(Cor_matrix_without_ERBB4_N[i,j] > 0.7) & i!=j){
      cor_vector <- c(cor_vector, i, j)
    }
  }
}

#proteins with these index should be excluded
cor_vector <- unique(cor_vector)
proteins_exclude <- row.names(Cor_matrix_without_ERBB4_N[cor_vector,])
proteis_predictors <- colnames(Cor_matrix_without_ERBB4_N)[!colnames(Cor_matrix_without_ERBB4_N) %in% proteins_exclude]
proteis_predictors
##  [1] "pCAMKII_N"       "pNR2A_N"         "pRSK_N"          "APP_N"          
##  [5] "SOD1_N"          "pP70S6_N"        "P70S6_N"         "pGSK3B_N"       
##  [9] "pPKCG_N"         "CDK5_N"          "S6_N"            "ADARB1_N"       
## [13] "RRP1_N"          "BAX_N"           "nNOS_N"          "GFAP_N"         
## [17] "GluR3_N"         "GluR4_N"         "IL1B_N"          "P3525_N"        
## [21] "pCASP9_N"        "PSD95_N"         "SNCA_N"          "Ubiquitin_N"    
## [25] "pGSK3B_Tyr216_N" "SHH_N"           "SYP_N"           "CaNA_N"

Using these 28 predictors, we made a linear model:

m1 <- lm(ERBB4_N ~ pCAMKII_N + pNR2A_N + pRSK_N + APP_N + SOD1_N + pP70S6_N + P70S6_N + pGSK3B_N + pPKCG_N + CDK5_N + S6_N + ADARB1_N + RRP1_N + BAX_N + nNOS_N + GFAP_N + GluR3_N + IL1B_N +  P3525_N + pCASP9_N + PSD95_N +SNCA_N + Ubiquitin_N +pGSK3B_Tyr216_N + SHH_N + SYP_N + CaNA_N, data = Mice_good)
summary(m1)
## 
## Call:
## lm(formula = ERBB4_N ~ pCAMKII_N + pNR2A_N + pRSK_N + APP_N + 
##     SOD1_N + pP70S6_N + P70S6_N + pGSK3B_N + pPKCG_N + CDK5_N + 
##     S6_N + ADARB1_N + RRP1_N + BAX_N + nNOS_N + GFAP_N + GluR3_N + 
##     IL1B_N + P3525_N + pCASP9_N + PSD95_N + SNCA_N + Ubiquitin_N + 
##     pGSK3B_Tyr216_N + SHH_N + SYP_N + CaNA_N, data = Mice_good)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033468 -0.004725  0.000252  0.004740  0.036943 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.093e-02  4.989e-03   4.196 2.95e-05 ***
## pCAMKII_N       -8.197e-05  4.264e-04  -0.192 0.847592    
## pNR2A_N         -8.699e-03  3.019e-03  -2.882 0.004035 ** 
## pRSK_N           3.135e-02  7.065e-03   4.438 1.01e-05 ***
## APP_N           -6.121e-03  5.243e-03  -1.168 0.243263    
## SOD1_N           9.596e-04  1.458e-03   0.658 0.510641    
## pP70S6_N         5.460e-03  3.517e-03   1.553 0.120843    
## P70S6_N         -5.723e-03  2.952e-03  -1.939 0.052806 .  
## pGSK3B_N         6.790e-02  2.550e-02   2.663 0.007869 ** 
## pPKCG_N         -6.626e-03  9.320e-04  -7.110 2.19e-12 ***
## CDK5_N           2.381e-02  9.709e-03   2.452 0.014374 *  
## S6_N             1.477e-02  3.230e-03   4.574 5.38e-06 ***
## ADARB1_N         8.543e-03  1.093e-03   7.817 1.34e-14 ***
## RRP1_N          -2.348e-02  9.940e-03  -2.362 0.018375 *  
## BAX_N            7.933e-02  2.131e-02   3.722 0.000208 ***
## nNOS_N           5.520e-02  1.674e-02   3.296 0.001013 ** 
## GFAP_N          -4.910e-02  3.199e-02  -1.535 0.125105    
## GluR3_N         -6.620e-02  1.019e-02  -6.496 1.28e-10 ***
## IL1B_N           6.890e-02  6.462e-03  10.663  < 2e-16 ***
## P3525_N          1.003e-01  1.560e-02   6.433 1.92e-10 ***
## pCASP9_N         8.406e-03  1.993e-03   4.217 2.69e-05 ***
## PSD95_N          1.526e-02  1.760e-03   8.674  < 2e-16 ***
## SNCA_N          -5.001e-03  2.178e-02  -0.230 0.818478    
## Ubiquitin_N     -6.014e-04  3.228e-03  -0.186 0.852241    
## pGSK3B_Tyr216_N  1.496e-03  4.631e-03   0.323 0.746704    
## SHH_N           -7.175e-03  1.212e-02  -0.592 0.554033    
## SYP_N            2.673e-02  6.223e-03   4.295 1.91e-05 ***
## CaNA_N          -9.209e-03  2.138e-03  -4.307 1.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007815 on 1019 degrees of freedom
## Multiple R-squared:  0.7422, Adjusted R-squared:  0.7354 
## F-statistic: 108.7 on 27 and 1019 DF,  p-value: < 2.2e-16

The model is quite good (Adjusted R-squared: 0.7354). We tried to correct these model to make it even more appropriate, but the results was wrong (available to take a look at .Rmd file if change include value in a chank)

Then it’s important to diagnose the resulting linear model. Let’s plot Cook distance as previously used:

mod_diag_2 <- fortify(m1)
ggplot(mod_diag_2, aes(x = 1:nrow(mod_diag_2), y = .cooksd)) + geom_bar(stat = "identity", fill = 'chocolate1') + ggtitle("Cook's distance graph ") + theme_bw() + xlab('Row number') + ylab('Cook distance')

And also we plotted the distribution of residuals from the model:

predict_value <- m1$fitted.values
Mice_good$DIF <- predict_value - Mice_good$ERBB4_N
ggplot(data = Mice_good, aes (x = 1:nrow(Mice_good), y = DIF)) + geom_point() + theme_bw() + geom_hline(yintercept = 0, col = 'red') + xlab('Row number') + ylab('Difference between predicted and real value')

To conclude, i may be proud of these model: we can’t see any patterns among residuals, Adjusted R-squared: 0.7354, F-statistic: 108.7 on 27 and 1019 DF, p-value: < 2.2e-16.

3)

We performed PCA to reduce dimensions:

proteins.pca <- prcomp(Mice_good_numeric, center = TRUE,scale. = TRUE)
summary(proteins.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.4503 3.4072 2.8105 2.33899 1.86301 1.82428 1.55734
## Proportion of Variance 0.2789 0.1635 0.1113 0.07705 0.04888 0.04687 0.03416
## Cumulative Proportion  0.2789 0.4425 0.5537 0.63077 0.67965 0.72652 0.76068
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.45602 1.28359 1.11052 1.01509 0.94326 0.85416 0.84546
## Proportion of Variance 0.02986 0.02321 0.01737 0.01451 0.01253 0.01028 0.01007
## Cumulative Proportion  0.79054 0.81375 0.83112 0.84563 0.85816 0.86844 0.87851
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.79527 0.75263 0.70358 0.68280 0.64661 0.63958 0.61419
## Proportion of Variance 0.00891 0.00798 0.00697 0.00657 0.00589 0.00576 0.00531
## Cumulative Proportion  0.88741 0.89539 0.90236 0.90893 0.91482 0.92058 0.92589
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.60263 0.56478 0.5461 0.53550 0.51218 0.48899 0.47211
## Proportion of Variance 0.00511 0.00449 0.0042 0.00404 0.00369 0.00337 0.00314
## Cumulative Proportion  0.93101 0.93550 0.9397 0.94374 0.94744 0.95080 0.95394
##                           PC29    PC30    PC31   PC32    PC33    PC34    PC35
## Standard deviation     0.44843 0.43174 0.42033 0.4128 0.40859 0.38049 0.37469
## Proportion of Variance 0.00283 0.00263 0.00249 0.0024 0.00235 0.00204 0.00198
## Cumulative Proportion  0.95677 0.95940 0.96189 0.9643 0.96664 0.96868 0.97066
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.35654 0.34868 0.34328 0.33609 0.33136 0.32733 0.31632
## Proportion of Variance 0.00179 0.00171 0.00166 0.00159 0.00155 0.00151 0.00141
## Cumulative Proportion  0.97245 0.97416 0.97582 0.97741 0.97896 0.98047 0.98188
##                           PC43   PC44    PC45    PC46    PC47   PC48    PC49
## Standard deviation     0.29736 0.2917 0.28820 0.28157 0.27362 0.2671 0.25851
## Proportion of Variance 0.00125 0.0012 0.00117 0.00112 0.00105 0.0010 0.00094
## Cumulative Proportion  0.98312 0.9843 0.98549 0.98661 0.98766 0.9887 0.98961
##                           PC50    PC51    PC52    PC53    PC54    PC55    PC56
## Standard deviation     0.24793 0.24498 0.23645 0.22764 0.22103 0.21583 0.21045
## Proportion of Variance 0.00087 0.00085 0.00079 0.00073 0.00069 0.00066 0.00062
## Cumulative Proportion  0.99047 0.99132 0.99210 0.99283 0.99352 0.99418 0.99480
##                          PC57    PC58    PC59   PC60    PC61    PC62    PC63
## Standard deviation     0.2063 0.19942 0.19206 0.1891 0.18430 0.17125 0.16071
## Proportion of Variance 0.0006 0.00056 0.00052 0.0005 0.00048 0.00041 0.00036
## Cumulative Proportion  0.9954 0.99596 0.99648 0.9970 0.99746 0.99788 0.99824
##                           PC64    PC65    PC66    PC67    PC68   PC69    PC70
## Standard deviation     0.15930 0.15640 0.13731 0.12629 0.12347 0.1198 0.10325
## Proportion of Variance 0.00036 0.00034 0.00027 0.00022 0.00021 0.0002 0.00015
## Cumulative Proportion  0.99860 0.99894 0.99921 0.99943 0.99965 0.9999 1.00000
##                             PC71
## Standard deviation     1.976e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

To visualize a percentage of the total variation for the first 9 components we created a plot:

Importance <- summary(proteins.pca)
DF_importance <- as.data.frame(t(as.matrix(Importance$importance)))
DF_importance$PC <- rownames(DF_importance)
DF_importance$Variance <- DF_importance$`Proportion of Variance`
ggplot(data=DF_importance[1:9,], aes(x = PC, y = Variance)) + geom_bar(stat = "identity") + theme_bw() + xlab("Principal component") + ylab("Percentage of the total variation")+ ggtitle("A percentage of the total variation\nfor the first 9 PC")

In order to precisely recognize the difference between groups, we also plotted 2D plot (PC1 & PC2):

autoplot(proteins.pca, data = Mice_good, colour = 'class') + theme_bw() 
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

And interactive 3D plot (PC1 & PC2 & PC3)

plot_ly(x=proteins.pca$x[,1], y=proteins.pca$x[,2], z=proteins.pca$x[,3], type="scatter3d", mode="markers", color = Mice_good$class_factor)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

We have too much proteins, that’s why factor load graph is a bit meaningless. The angles between the vectors reflect the correlations of features with each other and with the axes of the principal components.

fviz_pca_var(proteins.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE 
             )